Direct Reported Speech in Multilingual Texts: Automatic Annotation and Semantic Categorization
نویسندگان
چکیده
We propose an application for the automatic identification and categorization of quotations. The categorization is based on a semantic map of enunciative modalities. The texts are treated in three languages: Arabic, Korean and French. 1. General presentation and related works Automatic identification of quotations using natural language processing (NLP) is now significantly growing in recent studies (Mourad 2001), (Krestel, Bergler, and Witte 2008), InQuote, (Pouliquen, Steinberger, and Best 2008), (Audebert, Gaubert, and Jaccarini 2009) 3 and (De la Clergerie et al. 2009). We propose in this study an application for the automatic identification and categorization of quotations. This work can be distinguished from the previous ones in many aspects. First of all, our concerns are not to detect the source (holder) of the quotation, neither its anaphoric analysis, but we aim to identify all forms of quotation in texts by taking into consideration of its potential constructions. In addition, by using the theory of enunciation, we aim to automatically categorize the quotations in terms of various semantic criteria (commitment, opinion, judgment...), in a multilingual context (Arabic, French and Korean). Finally, the tool we use for automatic annotation, EXCOM , is a rule-based system that does not deal with any morpho-syntactic analysis or named entities recognition (Alrahabi and Desclés 2009b). EXCOM, implementing the method of Contextual Exploration (Desclés 2006), automatically performs the annotations using the surface forms of certain linguistic markers. In the following sections, we begin by presenting the linguistic analysis of quotations, and then we explain how 1 http://labs.google.com/inquotes/ 2 http://press.jrc.it/NewsExplorer/home/fr/latest.html 3 http://www.ifao.egnet.net/kawakib 4 http://www.excom.fr/ the linguistic markers can be organized in a semantic map. We finish the article by showing the result of the evaluation, and the perspectives. 2. Quotation analysis First, let’s introduce this important distinction between “utterer” (énonciateur) and “speaker” (locuteur). The utterer is the entity that reports the speech, whereas the speaker is the source (holder) of the speech. We consider, on the formal level, that a quotation is any kind of speech delimited by meta-characters (the typographical signs of quotation) and introduced by, at least, one linguistic marker referring to an act of speaking, whether the speaker is explicitly defined or not. We take into consideration any form of direct reported speech, as long as these rules are observed, i.e. the canonical forms and hybrids or mixed forms (such as the direct style introduced by “that”, see (Tuomarla 2000)). In general, we consider that an utterer can report a speaker's discourse in, at least, three ways : • By attributing to a speaker an implicit act of locution (Pour X [As for X] / ربخلا اذه مكيلا [Here is this news...] / [According to X]). This reflects the distance that the utterer takes in relation to the reported content. • By attributing to a speaker a speech as an act of “hearing” (Je me suis laissé entendre [It was intimated to me...] / يلي ام انغلب [This news has reached us] / [heard from X]). This often indicates the spread of information (or rumors). • By attributing to a speaker an explicit act of locution (X a décidé [X decided] / نلاف نلعأ [X declared] / 5 In Korean (Pak et al. 2009), a set of linguistic markers following quotation marks often indicate a real quotation, such as ( / lako, / lakoto / / ko, / koto, / ilako, etc.). 6 Examples in this paper are not identical from one language to another, but they belong to the same semantic categories. 162 Proceedings of the Twenty-Third International Florida Artificial Intelligence Research Society Conference (FLAIRS 2010)
منابع مشابه
Automatic Annotation of Direct Reported Speech in Arabic and French, According to a Semantic Map of Enunciative Modalities
We present an analysis of the linguistic markers of the enunciative modalities in direct reported speech, in a multilingual framework concerning Arabic and French. Furthermore, we present a platform for automatic annotation of semantic relations, based on the Contextual Exploration method. This platform allows the automatic annotation and categorisation of quotational segments in both languages...
متن کاملAccessing Multilingual Data on the Web for the Semantic Annotation of Cultural Heritage Texts
Our study targets interoperable semantic annotation of Cultural Heritage or eHumanities texts in German and Hungarian. A semantic resource we focus on is the Thompson Motif-index of folk-literature (TMI), the labels of which are available only in English. We investigate the use lexical data on the Web in German and Hungarian for supporting semi-automatic translation of TMI: lexical resources of...
متن کاملBioExcom: Automatic Annotation and categorization of speculative sentences in biological literature by a Contextual Exploratio
Biological research papers are replete with speculative sentences. This paper presents the BioExcom software, an adaptation of EXCOM platform to biology field, which annotates automatically all speculative sentences in full texts papers by the means of the Contextual Exploration processing. This annotation process is based on a concise semantic analysis of the multiple ways of expressing specul...
متن کاملAutomatic annotation of multilingual text collections with a conceptual thesaurus
Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and retrieval. The mapping of texts onto the same thesaurus furthermore allows to establish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper presents an almost language-independent system that...
متن کاملMultilingual Ontology Enrichment for Semantic Annotation and Retrieval of Medical Information
Background: Knowledge management in the European project Noesis addresses concept-based annotation and multilingual Information Retrieval of documents. Objective: Multilingual enrichment of a concept-based terminology in the medical field. Experience and evaluation in the domain of cardiovascular diseases by enriching a subset of the MeSH thesaurus in six European languages. This terminology, r...
متن کامل